Data Cleaning

To see all cleaned data click here

Quarterly and Annual Ridership Totals by Mode​ of Transportation 1

The purpose of this data is to gain a baseline perspective of the current state of public transit usage in the United States. Therefore, this data set should be cleaned in a way that trends can be visualized, without including superfluous information that does not relate to any current phenomena. The steps used in cleaning this data are below.

  • Trim the data set:
    • Columns 1 to 11 to trim blank items in the .csv file, as well as notes put in by the source.
    • Rows 81 to 133 to remove records from prior to 2010, as those are superfluous when comparing to current trends.
  • Create one column to account for year and quarter to improve readability
  • Convert all numeric rows to numeric data type
  • Remove extra year and quarter columns as they are now unnecessary

Regarding the numeric fields, I have chosen to keep them all for now as each one can provide insight into which modes of transportation are most affected by certain factors. Below is the code to apply the steps laid out, as well as a comparison between the raw and cleaned data sets.

Code
import pandas as pd
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Code
library(tidyverse)
library(tidyr)

ridership <- read.csv("../data/APTA-Ridership-by-Mode-and-Quarter-1990-Present.csv")
ridership <- ridership[81:133,1:11]
colnames(ridership)[2] <- 'Year - Quarter'
colnames(ridership)[4:11] <- c("total_ridership", "heavy_rail", "light_rail", "commuter_rail", "trolleybus", "bus", "demand_response", "other")
ridership$total_ridership <- as.numeric(gsub(",","", ridership$total_ridership))
ridership$heavy_rail <- as.numeric(gsub(",","", ridership$heavy_rail))
ridership$light_rail <- as.numeric(gsub(",","", ridership$light_rail))
ridership$commuter_rail <- as.numeric(gsub(",","", ridership$commuter_rail))
ridership$trolleybus <- as.numeric(gsub(",","", ridership$trolleybus))
ridership$bus <- as.numeric(gsub(",","", ridership$bus))
ridership$demand_response <- as.numeric(gsub(",","", ridership$demand_response))
ridership$other <- as.numeric(gsub(",","", ridership$other))
ggplot(data=ridership, aes(x=factor(`Year - Quarter`), y=total_ridership, group=1, xmin = "2015 - Q1", xmax="2023-Q1")) +
  geom_line()+
  geom_point()+
  labs(x = "Year - Quarter", y = "Total Ridership (000s)", title = "Total Public Transit Ridership in the U.S.")+
  theme(axis.text.x = element_text(angle = 45))
ridership <- ridership[c(2, 4:11)]
head(ridership)
write.csv(ridership, "../data/cleaned_data/ridership_by_quarter_cleaned.csv")

Raw Quarterly Ridership Data

Cleaned Quarterly Ridership Data

News API Data 2

This data in its raw form comes as a JSON file with each record corresponding to a particular article. The purpose of cleaning this will be to analyze word prevalence, which can be done by creating a corpus. The steps for this are recycled from DSAN-5000 Lab 2.1, and are described as follows:

  • Retrieve the raw data JSON file for WMATA news.
  • Create a string cleaning function to deal with punctuation, special characters, and differently cased letters
  • Iterate through each article
    • Iterate through each data point in an article to clean strings and append cleaned data to output list
  • Convert cleaned data to data frame
  • Create corpus from cleaned data
  • Use CountVectorizer to retrieve vocabulary for the data set
  • Repeat for BART

Below is the code, along with images of the cleaned data.

Code
import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

#WMATA

x = open('../data/WMATA-newapi-raw-data.json')
response = json.load(x)

def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+
                    \ *
                    """,
                    " ",
                    input_string, flags=re.VERBOSE)

        out = re.sub('[’.]+', '', input_string)
        out = re.sub(r'\s+', ' ', out)
        out=out.lower()
    except:
        print("ERROR")
        out=''
    return out

article_list=response['articles']
article_keys=article_list[0].keys()
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    for key in article_keys:
        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            if(src in author): 
                author='NA'
            tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

df = pd.DataFrame(cleaned_data)
print(df.head())
df.to_csv('../data/cleaned_data/wmata_news_cleaned.csv', index=False)

corpus = df[2]
vectorizer=CountVectorizer()
word_counts  =  vectorizer.fit_transform(corpus)
print("vocabulary = ",vectorizer.vocabulary_)
with open('../data/cleaned_data/wmata_vocabulary.json', 'w') as outfile:
    json.dump(vectorizer.vocabulary_, outfile)

## BART
x = open('../data/BART-newapi-raw-data.json')
response = json.load(x)

article_list=response['articles']
article_keys=article_list[0].keys()
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    for key in article_keys:
        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            if(src in author): 
                author='NA'
            tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

df = pd.DataFrame(cleaned_data)
print(df.head())
df.to_csv('../data/cleaned_data/bart_news_cleaned.csv', index=False)

corpus = df[2]
vectorizer=CountVectorizer()
word_counts  =  vectorizer.fit_transform(corpus)
print("vocabulary = ",vectorizer.vocabulary_)
with open('../data/cleaned_data/bart_vocabulary.json', 'w') as outfile:
    json.dump(vectorizer.vocabulary_, outfile)
                            0                             1   
0     construction-physicscom                  brian potter  \
1               planetizencom                 diana ionescu   
2               planetizencom                 diana ionescu   
3  greater greater washington  eleanor barker (contributor)   
4  greater greater washington    miles wilson (contributor)   

                                                   2                     3  
0                    how washington dc got its metro  2023-09-28T03:40:20Z  
1             opinion: fare evasion is a red herring  2023-09-13T12:00:00Z  
2                     dc delays bus lane enforcement  2023-09-22T13:00:00Z  
3  breakfast links: wmata pursuing automated door...  2023-09-15T13:10:00Z  
4  breakfast links: wmata set to present three bu...  2023-09-27T12:44:00Z  
vocabulary =  {'how': 112, 'washington': 232, 'dc': 58, 'got': 103, 'its': 122, 'metro': 136, 'opinion': 160, 'fare': 84, 'evasion': 81, 'is': 119, 'red': 176, 'herring': 106, 'delays': 61, 'bus': 33, 'lane': 124, 'enforcement': 76, 'breakfast': 30, 'links': 129, 'wmata': 240, 'pursuing': 175, 'automated': 20, 'doors': 67, 'but': 35, 'fully': 97, 'operation': 159, 'delayed': 60, 'until': 230, 'next': 150, 'year': 242, 'set': 197, 'to': 219, 'present': 173, 'three': 214, 'budget': 32, 'scenarios': 191, 'thwart': 216, 'shortfall': 198, 'announces': 15, 'start': 202, 'date': 56, 'for': 90, '24': 2, 'hour': 110, 'metrobus': 137, 'service': 196, 'in': 115, '7000': 7, 'series': 195, 'trains': 222, 'not': 153, 'at': 19, 'fault': 86, 'others': 161, 'under': 228, 'inspection': 116, 'after': 9, 'friday': 94, 'derailment': 62, 'bridge': 31, 'has': 104, 'been': 24, 'established': 79, 'from': 95, 'dupont': 73, 'circle': 44, 'gallery': 100, 'place': 169, 'via': 231, 'farragut': 85, 'north': 152, 'and': 13, 'center': 42, 'the': 211, 'first': 89, 'time': 218, '50': 5, 'history': 107, 'looking': 131, 'revamp': 183, 'system': 209, 'simplify': 200, 'routes': 187, 'with': 238, 'better': 26, 'names': 146, 'campaign': 36, 'plans': 171, 'full': 96, 'scale': 190, 'security': 192, 'exercise': 82, 'capitol': 38, 'south': 201, 'station': 204, 'sunday': 208, 'sept': 194, '17': 0, 'huzzah': 113, 'all': 11, 'lost': 133, 'new': 149, 'found': 91, 'policy': 172, 'expands': 83, 'items': 121, 'will': 237, 'hold': 109, 'return': 182, 'customers': 54, 'weve': 236, 'seen': 193, 'least': 125, '70': 6, 'reduction': 177, 'those': 213, 'stations': 205, 'weekend': 235, 'track': 220, 'work': 241, 'maintenance': 134, 'on': 157, 'line': 127, 'close': 46, 'four': 92, 'be': 23, 'unavailable': 227, 'free': 93, 'shuttle': 199, 'buses': 34, 'replacing': 181, 'december': 59, '18': 1, 'through': 215, 'saturday': 189, '30': 3, 'normal': 151, 'most': 144, 'lines': 128, 'minute': 141, 'traveling': 225, 'around': 18, 'national': 147, 'airport': 10, 'path': 167, 'one': 158, 'million': 140, 'rides': 184, 'day': 57, 'starts': 203, 'enhancements': 77, 'get': 101, 'we': 233, 'deserve': 63, 'transform': 223, 'parking': 163, 'into': 118, 'housing': 111, 'train': 221, 'details': 64, 'near': 148, 'amtraks': 12, 'battle': 22, 'more': 142, 'control': 51, 'of': 156, 'union': 229, 'continues': 50, 'loose': 132, 'brake': 29, 'bolts': 28, 'might': 138, 'have': 105, 'caused': 41, 'participate': 164, 'week': 234, 'without': 239, 'driving': 71, 'october': 155, 'much': 145, 'good': 102, 'can': 37, 'localism': 130, 'do': 66, 'downtown': 70, 'repairs': 179, '34': 4, 'federal': 87, 'funding': 98, 'tree': 226, 'equity': 78, 'awarded': 21, 'region': 178, 'council': 52, 'passes': 166, 'bike': 27, 'subsidy': 207, 'transit': 224, 'future': 99, 'if': 114, 'plan': 170, 'it': 120, 'morning': 143, 'notes': 154, 'ticket': 217, 'stops': 206, 'before': 25, 'army': 17, 'ten': 210, 'miler': 139, 'cause': 40, 'road': 185, 'closures': 47, 'rosslyn': 186, 'pentagon': 168, 'city': 45, 'this': 212, 'dulles': 72, 'hits': 108, 'passengers': 165, 'animais': 14, 'diplomáticos': 65, 'china': 43, 'pretende': 174, 'repatriar': 180, 'pandas': 162, 'dos': 69, 'eua': 80, 'em': 75, 'meio': 135, 'sanções': 188, 'econômicas': 74, 'commuter': 49, 'advocates': 8, 'cuts': 55, 'are': 16, 'likely': 126, 'coming': 48, 'cta': 53, 'fire': 88, 'dorval': 68, 'carter': 39, 'jr': 123, 'instead': 117}
ERROR
ERROR
                  0               1   
0  business insider      sabina wex  \
1          politico  jeremy b white   
2          politico     blake jones   
3     planetizencom   diana ionescu   
4     cleantechnica            nrdc   

                                                   2                     3  
0  ai sensors and wifi networks that provide work...  2023-09-28T19:55:01Z  
1  why a quick feinstein replacement is in newsom...  2023-09-29T19:34:46Z  
2  feinsteins friends and colleagues in californi...  2023-09-29T19:59:09Z  
3  post-pandemic travel patterns call for differe...  2023-09-21T17:00:00Z  
4  new transit service funding in california — ho...  2023-09-29T15:25:41Z  
vocabulary =  {'ai': 8, 'sensors': 152, 'and': 10, 'wifi': 189, 'networks': 110, 'that': 169, 'provide': 134, 'workplace': 192, 'insights': 87, 'are': 14, 'antidotes': 13, 'for': 68, 'some': 157, 'companies': 44, 'combating': 40, 'zombie': 195, 'offices': 118, 'why': 188, 'quick': 136, 'feinstein': 63, 'replacement': 140, 'is': 89, 'in': 84, 'newsoms': 113, 'best': 21, 'interest': 88, 'feinsteins': 64, 'friends': 72, 'colleagues': 38, 'california': 29, 'mourn': 105, 'true': 179, 'giant': 76, 'post': 131, 'pandemic': 122, 'travel': 178, 'patterns': 124, 'call': 30, 'different': 56, 'transit': 175, 'schedules': 148, 'new': 111, 'service': 153, 'funding': 74, 'how': 82, 'states': 160, 'can': 31, 'find': 66, 'more': 104, 'san': 144, 'franciscos': 70, 'transport': 176, 'agency': 7, 'metropolitan': 101, 'transportation': 177, 'commission': 43, 'mtc': 106, 'exposes': 62, 'drivers': 59, 'plate': 127, 'numbers': 115, 'addresses': 6, 'bay': 19, 'area': 15, 'rapid': 137, 'freeway': 71, 'controversy': 48, 'microsoft': 103, 'lists': 97, 'tuesday': 180, 'researchbuzz': 141, 'october': 116, '2023': 3, 'cityview': 37, 'stockbridge': 161, 'commence': 41, 'construction': 47, 'on': 119, '265': 5, 'unit': 181, 'multifamily': 107, 'development': 54, 'los': 98, 'angeles': 11, 'south': 159, 'column': 39, 'black': 23, 'woman': 191, 'to': 171, 'temporarily': 168, 'fill': 65, 'dianne': 55, 'seat': 150, 'don': 58, 'count': 49, 'it': 90, 'newsom': 112, 'another': 12, 'side': 154, 'of': 117, 'difi': 57, 'hike': 80, 'right': 142, 'from': 73, 'the': 170, 'city': 36, '24': 4, 'us': 182, 'metros': 102, 'with': 190, 'nearby': 109, 'trails': 173, 'get': 75, 'sonoma': 158, 'county': 50, 'santa': 145, 'rosa': 143, 'airport': 9, 'sts': 165, 'smart': 156, 'train': 174, 'connect': 46, 'person': 125, 'or': 120, 'groome': 77, '20': 2, 'joshua': 92, 'norton': 114, 'declares': 52, 'himself': 81, 'emperor': 60, '1859': 1, 'watching': 184, 'real': 138, 'estate': 61, 'bust': 28, 'streets': 163, 'francisco': 69, 'luis': 99, 'porrello': 130, 'john': 91, 'fisher': 67, 'bolster': 24, 'wsp': 193, 'west': 185, 'leadership': 95, 'team': 167, 'peso': 126, 'pluma': 128, 'josé': 93, 'parking': 123, 'bag': 18, 'policy': 129, 'what': 186, 'know': 94, 'before': 20, 'heading': 79, 'sap': 146, 'center': 34, 'congressional': 45, 'progressive': 132, 'caucus': 33, 'pac': 121, 'backs': 17, 'candidate': 32, 'who': 187, 'signed': 155, 'deal': 51, 'demanding': 53, 'reparations': 139, 'hardly': 78, 'strictly': 164, 'schedule': 147, 'maps': 100, 'you': 194, 'bring': 26, 'influx': 86, 'asylum': 16, 'seekers': 151, 'puts': 135, 'strain': 162, 'border': 25, 'immigration': 83, 'system': 166, 'comment': 42, 'utiliser': 183, 'chatgpt': 35, '10': 0, 'prompts': 133, 'incroyables': 85, 'top': 172, 'music': 108, 'business': 27, 'schools': 149, 'billboards': 22, 'list': 96}

Cleaned WMATA News Data

Cleaned BART News Data

Ridership by Hour

These data sets can be used to compare ridership trends before and after the pandemic and all ramifications that came from it. These both show average daily entries and exits in Washington, D.C. by hour of the day, allowing us to see when people use public transit, and ultimately infer why they may be using it. The difference in the data is that the before data set contains data from January 1, 2018 to March 17, 2020, while the after data set contains data from March 18, 2020 to October 5, 2023.

The steps for cleaning these data sets are as follows:

  • Read .csv files and remove rounded fields, as they are duplicative
  • Rename columns for readability
  • Convert numeric columns to numeric data type
  • Introduce hour_numeric column for future time series analysis
  • Rearrange columns

The code for carrying this out and screenshots of the cleaned before data set are below. Additionally, plots of the data sets have been charted to visualize the data that is being obtained.

Code
library(tidyverse)
library(tidyr)

before <- read.csv("../data/WMATA_boardings_by_hour/boardings_pre-covid.csv")
after <- read.csv("../data/WMATA_boardings_by_hour/boardings_post-covid.csv")
before <- before[c(1,2,4)]
after <- after[c(1,2,4)]
colnames(before) <- c("hour", "avg_daily_entries", "avg_daily_exits")
colnames(after) <- c("hour", "avg_daily_entries", "avg_daily_exits")
before$avg_daily_entries <- as.numeric(gsub(",","", before$avg_daily_entries))
before$avg_daily_exits <- as.numeric(gsub(",","", before$avg_daily_exits))
after$avg_daily_entries <- as.numeric(gsub(",","", after$avg_daily_entries))
after$avg_daily_exits <- as.numeric(gsub(",","", after$avg_daily_exits))
before$hour_numeric <- c(4:23, 0:3)
after$hour_numeric <- c(4:23, 0:3)
before <- before[c(1,4,2,3)]
after <- after[c(1,4,2,3)]
ggplot(data=before, aes(x=factor(hour_numeric, ordered = FALSE), y=avg_daily_entries, group=1)) +
  geom_line()+
  geom_point()+
  labs(x = "Numeric Hour of Day", y = "Average Daily Entries", title = "Average Daily Entries by Hour (Pre-Pandemic)")
ggplot(data=after, aes(x=factor(hour_numeric, ordered = FALSE), y=avg_daily_entries, group=1)) +
  geom_line()+
  geom_point()+
  labs(x = "Numeric Hour of Day", y = "Average Daily Entries", title = "Average Daily Entries by Hour (Post-Pandemic)")
head(before)
head(after)
write.csv(before, "../data/cleaned_data/hourly_average_cleaned_pre-covid.csv")
write.csv(after, "../data/cleaned_data/hourly_average_cleaned_post-covid.csv")

Hourly Ridership from 1/1/2018 to 3/17/2020 - Cleaned

Ridership by Demographic 6

Lastly, the purpose of this data is to see the rates at which demographic groups use different modes of transportation for commuting to their occupation. The raw data set contains all demographic differentiators in the same table, which would be classified as untidy data. Thus, it will be necessary to split these into several tables; one for each demographic type. Additionally, the columns denoting percent error are useful for understanding the data, but could be cumbersome for conducting EDA, so we will only be focusing on the proportions given in the data columns.

Cleaning this data set will allow us to use R to clean qualitative, as well as quantitative variables. The following are steps for carrying this out:

  • Read full .csv file and rename columns for readability based on glossary given by the data source
  • Select only rows that split records by age, and only columns that contain data points
  • Trim leading spaces from age column
  • Remove percentage symbol from numeric fields and convert them to numeric data type
  • Repeat process for sex, race, citizenship status, and earnings

Below is the code, and a sample screenshot from the earnings cleaned data set.

Code
library(tidyverse)
library(tidyr)
library(stringr)

demographics <- read.csv("../data/ridership_by_demographic_2021.csv")
colnames(demographics)[c(2,4,6,8)] <- c("total", "drive_alone", "carpool", "public_transit")
age <- demographics[c(3:8), c(1,2,4,6,8)]
colnames(age)[1] <- "age_group"
age$age_group <- str_trim(age$age_group, "left")
age$total <- as.numeric(substr(age$total, 1, nchar(age$total)-1))
age$drive_alone <- as.numeric(substr(age$drive_alone, 1, nchar(age$drive_alone)-1))
age$carpool <- as.numeric(substr(age$carpool, 1, nchar(age$carpool)-1))
age$public_transit <- as.numeric(substr(age$public_transit, 1, nchar(age$public_transit)-1))
head(age)
write.csv(age, "../data/cleaned_data/ridership_age.csv")

sex <- demographics[c(11:12), c(1,2,4,6,8)]
colnames(sex)[1] <- "sex"
sex$sex <- str_trim(sex$sex, "left")
sex$total <- as.numeric(substr(sex$total, 1, nchar(sex$total)-1))
sex$drive_alone <- as.numeric(substr(sex$drive_alone, 1, nchar(sex$drive_alone)-1))
sex$carpool <- as.numeric(substr(sex$carpool, 1, nchar(sex$carpool)-1))
sex$public_transit <- as.numeric(substr(sex$public_transit, 1, nchar(sex$public_transit)-1))
head(sex)
write.csv(sex, "../data/cleaned_data/ridership_sex.csv")

citizenship <- demographics[c(25:28), c(1,2,4,6,8)]
colnames(citizenship)[1] <- "status"
citizenship$status <- str_trim(citizenship$status, "left")
citizenship$total <- as.numeric(substr(citizenship$total, 1, nchar(citizenship$total)-1))
citizenship$drive_alone <- as.numeric(substr(citizenship$drive_alone, 1, nchar(citizenship$drive_alone)-1))
citizenship$carpool <- as.numeric(substr(citizenship$carpool, 1, nchar(citizenship$carpool)-1))
citizenship$public_transit <- as.numeric(substr(citizenship$public_transit, 1, nchar(citizenship$public_transit)-1))
head(citizenship)
write.csv(citizenship, "../data/cleaned_data/ridership_citizenship.csv")

earnings <- demographics[c(35:42), c(1,2,4,6,8)]
colnames(earnings)[1] <- "range"
earnings$range <- str_trim(earnings$range, "left")
earnings$total <- as.numeric(substr(earnings$total, 1, nchar(earnings$total)-1))
earnings$drive_alone <- as.numeric(substr(earnings$drive_alone, 1, nchar(earnings$drive_alone)-1))
earnings$carpool <- as.numeric(substr(earnings$carpool, 1, nchar(earnings$carpool)-1))
earnings$public_transit <- as.numeric(substr(earnings$public_transit, 1, nchar(earnings$public_transit)-1))
head(earnings)
write.csv(earnings, "../data/cleaned_data/ridership_earnings.csv")

Transportation Methods by Earnings - Cleaned

Commute to Work by Demographic 7

The main objective of cleaning this data is to narrow down the fields to remove superfluous columns, and to decode the numerical values that the dataset has in place of categorical values. To do this, we will reference the glossary that accompanies the dataset. The steps are the following:

  • Remove columns that provide excess detail
  • Rename columns
  • Remove columns that will not be necessary for Naive Bayes classification
  • Replace codes for sex, marital_status, race, hispanic, employment, metropolitan_status, and transportation_type
    • Codes for metropolitan_status and transportation_type are aggregated to simplify data (e.g., all public transit types are labeled public_transit)
  • Set age and personal_income to numerical data types
  • Set all values where personal income is 0 and the person is not in the labor force to NA
  • Drop all rows where transportation_type is NA, as those are not labeled

The code and output are shown below:

Code
library(tidyverse)
library(tidyr)

commute <- read.csv("../data/IPUMS_commute.csv")
unique(commute$YEAR)
commute <- commute[-c(2,4:10,15,17,21,24)]
colnames(commute) <- c('year','id','sex','age','marital_status','race','hispanic','citizenship','english',
                       'employment','labor_force','worker_class', 'personal_income', 'family_income',
                       'state','county','metropolitan_status','transportation_type')
commute <- commute[,!(names(commute) %in% c('citizenship','english','labor_force','worker_class','family_income','state','county'))]
commute$sex <- replace(commute$sex, commute$sex=='1', 'Male')
commute$sex <- replace(commute$sex, commute$sex=='2', 'Female')
commute$age <- as.integer(commute$age)
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='1', 'Married_present')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='2', 'Married_absent')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='3', 'Separated')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='4', 'Divorced')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='5', 'Widowed')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='6', 'Never_married')
commute$race <- replace(commute$race, commute$race=='1', 'White')
commute$race <- replace(commute$race, commute$race=='2', 'Black')
commute$race <- replace(commute$race, commute$race=='3', 'American_Indian')
commute$race <- replace(commute$race, commute$race=='4', 'Chinese')
commute$race <- replace(commute$race, commute$race=='5', 'Japanese')
commute$race <- replace(commute$race, commute$race=='6', 'Other_Asian_PI')
commute$race <- replace(commute$race, commute$race=='7', 'Other_race')
commute$race <- replace(commute$race, commute$race=='8', 'Two_races')
commute$race <- replace(commute$race, commute$race=='9', 'Three_or_more_races')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='0', 'Not_Hispanic')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='1', 'Mexican')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='2', 'Puerto_Rican')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='3', 'Cuban')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='4', 'Other')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='9', 'Not_reported')
commute$employment <- replace(commute$employment, commute$employment=='0', NA)
commute$employment <- replace(commute$employment, commute$employment=='1', 'Employed')
commute$employment <- replace(commute$employment, commute$employment=='2', 'Unemployed')
commute$employment <- replace(commute$employment, commute$employment=='3', 'Not_labor_force')
commute$personal_income <- as.integer(commute$personal_income)
commute$personal_income <- replace(commute$personal_income,commute$employment=='Not_labor_force' & commute$personal_income==0, NA)
commute$metropolitan_status <- replace(commute$metropolitan_status, commute$metropolitan_status=='0', NA)
commute$metropolitan_status <- replace(commute$metropolitan_status, commute$metropolitan_status %in% c('1','2','3','4','5'), 'Metro_area')
commute$metropolitan_status <- replace(commute$metropolitan_status, commute$metropolitan_status %in% c('6','7','8'), 'Not_metro_area')
commute$metropolitan_status <- replace(commute$metropolitan_status, commute$metropolitan_status=='9', NA)
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='0', NA)
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type %in% c('10','11','12','13','14','15','20'), 'Private_vehicle')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type %in% c('31','32','33','34','35','36','37','38','39'), 'Public_transit')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='50', 'Bicycle')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='60', 'Walk')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='70', 'Other')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='80', 'Work_from_home')
commute <- commute %>% drop_na(transportation_type)

write.csv(commute, "../data/cleaned_data/commute_by_demographic.csv")

Commute by Demographic - Cleaned

WMATA and BART Yelp Reviews 8 9

The purpose of cleaning this data is to perform Naive Bayes classification in the future, as we have labeled text data that can be valuable for analyzing how people express their opinions on public transit systems. In the raw data that was obtained, there are duplicates on each page which must be dealt with, as well as a need for correcting the data types. The steps for this are:

  • Remove excess columns
  • Remove rows where review is duplicated (date is NA in these records, so we drop based on that)
  • Change column names
  • Take just the numerical rating and set to integer type
  • Set Date field to date type

The code and output are below:

Code
import datetime
from datetime import datetime
Code
wmata_yelp = pd.read_csv('../data/yelp_reviews/wmata_reviews.csv')
wmata_yelp = wmata_yelp.drop(columns='Unnamed: 0')
wmata_yelp = wmata_yelp[wmata_yelp['1'].notna()]
wmata_yelp = wmata_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
wmata_yelp['Rating'] = wmata_yelp['Rating'].str[0].astype(int)
for i in wmata_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
wmata_yelp.to_csv('../data/cleaned_data/wmata_reviews_cleaned.csv', index=False)
wmata_yelp.head(10)
Rating Date Review
2 5 Jul 26, 2023 I had to compliment WMATA on the shuttle servi...
3 1 Oct 27, 2023 Since when did metro close the doors to their ...
4 1 Sep 29, 2023 So many things wrong with wmata I can't even b...
5 5 Sep 23, 2023 I WFH, and WMATA has been very helpful in my m...
6 4 Sep 18, 2022 I took the metro while visiting DC. I began at...
7 1 Sep 11, 2023 Less than zero. Some bureaucrat decided to rep...
8 3 Jul 27, 2022 The Georgia Avenue Limited Line is the Bus 79 ...
9 3 Jun 22, 2022 Senior SmarTrip card costs $2 and provides $1 ...
10 3 Jul 11, 2021 Beauty is in the eye of the beholder.Being a n...
11 1 Aug 20, 2023 The man who works at the Judiciary Square stop...
Code
bart_yelp = pd.read_csv('../data/yelp_reviews/bart_reviews.csv')
bart_yelp = bart_yelp.drop(columns='Unnamed: 0')
bart_yelp = bart_yelp[bart_yelp['1'].notna()]
bart_yelp = bart_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
bart_yelp['Rating'] = bart_yelp['Rating'].str[0].astype(int)
for i in bart_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
bart_yelp.to_csv('../data/cleaned_data/bart_reviews_cleaned.csv', index=False)
bart_yelp.head(10)
Rating Date Review
2 3 Oct 2, 2023 Bart is convenient but not intuitive. The line...
3 1 Aug 3, 2023 I'm sorry guys..... You failed in a most embar...
4 2 Jul 27, 2023 I know what crack smells like--not by choice. ...
5 5 Apr 20, 2023 Okay, not perfect but such a fast easy way to ...
6 1 Nov 7, 2023 Can someone please tell me wtf is that horribl...
7 4 Feb 5, 2023 Bart in General is a great way to get around t...
8 1 Jul 6, 2023 Disgusting, and very loud. Never been on a tra...
9 3 Jan 3, 2023 There is no greater way to keep tabs on the pu...
10 3 Nov 18, 2022 I judge a city's public transportation chiefly...
11 1 Jul 29, 2023 As a visitor, I found Bart a combination of co...
Code
mta_yelp = pd.read_csv('../data/yelp_reviews/mta_reviews.csv')
mta_yelp = mta_yelp.drop(columns='Unnamed: 0')
mta_yelp = mta_yelp[mta_yelp['1'].notna()]
mta_yelp = mta_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
mta_yelp['Rating'] = mta_yelp['Rating'].str[0].astype(int)
for i in mta_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
mta_yelp.to_csv('../data/cleaned_data/mta_reviews_cleaned.csv', index=False)
Code
la_yelp = pd.read_csv('../data/yelp_reviews/la_reviews.csv')
la_yelp = la_yelp.drop(columns='Unnamed: 0')
la_yelp = la_yelp[la_yelp['1'].notna()]
la_yelp = la_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
la_yelp['Rating'] = la_yelp['Rating'].str[0].astype(int)
for i in la_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
la_yelp.to_csv('../data/cleaned_data/la_reviews_cleaned.csv', index=False)
Code
cta_yelp = pd.read_csv('../data/yelp_reviews/cta_reviews.csv')
cta_yelp = cta_yelp.drop(columns='Unnamed: 0')
cta_yelp = cta_yelp[cta_yelp['1'].notna()]
cta_yelp = cta_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
cta_yelp['Rating'] = cta_yelp['Rating'].str[0].astype(int)
for i in cta_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
cta_yelp.to_csv('../data/cleaned_data/cta_reviews_cleaned.csv', index=False)
Code
septa_yelp = pd.read_csv('../data/yelp_reviews/septa_reviews.csv')
septa_yelp = septa_yelp.drop(columns='Unnamed: 0')
septa_yelp = septa_yelp[septa_yelp['1'].notna()]
septa_yelp = septa_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
septa_yelp['Rating'] = septa_yelp['Rating'].str[0].astype(int)
for i in septa_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
septa_yelp.to_csv('../data/cleaned_data/septa_reviews_cleaned.csv', index=False)
Code
mbta_yelp = pd.read_csv('../data/yelp_reviews/mbta_reviews.csv')
mbta_yelp = mbta_yelp.drop(columns='Unnamed: 0')
mbta_yelp = mbta_yelp[mbta_yelp['1'].notna()]
mbta_yelp = mbta_yelp.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
mbta_yelp['Rating'] = mbta_yelp['Rating'].str[0].astype(int)
for i in mbta_yelp['Date']:
    i = datetime.strptime(i, "%b %d, %Y")
mbta_yelp.to_csv('../data/cleaned_data/mbta_reviews_cleaned.csv', index=False)
Code
total_yelp = pd.concat([wmata_yelp,bart_yelp,mta_yelp,la_yelp,cta_yelp,septa_yelp,mbta_yelp])
print(total_yelp.shape)
print(total_yelp.columns)
total_yelp.to_csv('../data/cleaned_data/total_yelp.csv')
(2434, 3)
Index(['Rating', 'Date', 'Review'], dtype='object')

APTA

Code
library(tidyverse)
library(tidyr)
library(readxl)

cities <- read_excel("../data/apta-cities_9-23.xlsx",sheet = 2)
cities <- cities[-c(2,4:6,8:12,16:17,19:20)]
cities <- cities[(cities$Status %in% "Active" & cities$`Most Recent Report Year` %in% 2022),]
unique(cities$Status)
head(cities)
nrows <- length(unique(cities$`UZA Name`))
cities2 <- data.frame(City=character(nrows), Population=numeric(nrows), Area=numeric(nrows),
                      Cost_per_trip=numeric(nrows), Fare_per_trip=numeric(nrows), Miles_per_trip=numeric(nrows))
cities2$City <- unique(cities$`UZA Name`)
for (i in 1:nrows) {
  cities2[i,2] <- as.numeric(unique(cities$`UZA Population`[cities$`UZA Name`==cities2[i,1]])[1])
}
for (i in 1:nrows) {
  cities2[i,3] <- round(as.numeric(unique(cities$`UZA SQ Miles`[cities$`UZA Name`==cities2[i,1]])[1]), digits = 2)
}
for (i in 1:nrows) {
  cities2[i,4] <- sum((cities$`Avg Cost Per Trip FY`[cities$`UZA Name`==cities2[i,1]] * cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]])
                      / sum(cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]]))
}
for (i in 1:nrows) {
  cities2[i,5] <- sum((cities$`Avg Fares Per Trip FY`[cities$`UZA Name`==cities2[i,1]] * cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]])
                      / sum(cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]]))
}
for (i in 1:nrows) {
  cities2[i,6] <- sum((cities$`Avg Trip Length FY`[cities$`UZA Name`==cities2[i,1]] * cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]])
                      / sum(cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]]))
}
head(cities2)
write.csv(cities2, "../data/cleaned_data/apta_cities_cleaned.csv")

Footnotes

  1. “Ridership Report.” American Public Transportation Association, 21 Sept. 2023, www.apta.com/research-technical-resources/transit-statistics/ridership-report/.↩︎

  2. “News API – Search News and Blog Articles on the Web.” News API €“ Search News and Blog Articles on the Web, newsapi.org/. Accessed 12 Oct. 2023.↩︎

  3. Barrero, Jose Maria, et al. Why Working from Home Will Stick, 2021, https://doi.org/10.3386/w28731.↩︎

  4. “Washington Metropolitan Area Transit Authority.” WMATA, www.wmata.com/initiatives/ridership-portal/. Accessed 12 Oct. 2023.↩︎

  5. “Ridership Reports.” Ridership Reports | Bay Area Rapid Transit, www.bart.gov/about/reports/ridership. Accessed 13 Oct. 2023.↩︎

  6. U.S. Census Bureau. “MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS.” American Community Survey, ACS 5-Year Estimates Subject Tables, Table S0802, 2021, https://data.census.gov/table/ACSST5Y2021.S0802?t=Commuting&g=860XX00US20020,20032. Accessed on October 12, 2023.↩︎

  7. Steven Ruggles, Sarah Flood, Matthew Sobek, Danika Brockman, Grace Cooper, Stephanie Richards, and Megan Schouweiler. IPUMS USA: Version 13.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V13.0↩︎

  8. “WMATA - Washington, DC, DC,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 2, 2023).↩︎

  9. “Bart - Bay Area Rapid Transit - Oakland, CA,” Yelp, https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2 (accessed Nov. 2, 2023).↩︎